Skip to content

NO-ISSUE: Restrict test retries to an allowlist of test names#31222

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
mkowalski:retry-allowlist
May 29, 2026
Merged

NO-ISSUE: Restrict test retries to an allowlist of test names#31222
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
mkowalski:retry-allowlist

Conversation

@mkowalski

@mkowalski mkowalski commented May 27, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Replace the image-tag-based and hardcoded retry eligibility in shouldRetryTest() with a single YAML allowlist (retry_allowed_tests.yaml) that is the sole source of truth for which tests may retry on failure.
  • Previously, any test from a permitted image tag could retry, and 6 specific tests had hardcoded exceptions. Now, a test must be explicitly listed in the YAML to get a retry.
  • The allowlist is populated from BigQuery retry_statistics data, scoped to release 4.22, last 30 days, with a threshold of >= 5 flake occurrences. This yields 206 tests.

Behavior

Scenario Retries?
Test listed in the YAML allowlist Yes
Test NOT in the allowlist No
Empty allowlist (tests: []) No

How to allow retries for a test

Add the full test name to pkg/test/ginkgo/retry_allowed_tests.yaml:

tests:
  - "[sig-network] Services should be rejected for evicted pods ..."
  - "[sig-node] Pods Extended Pod Container lifecycle ..."

The file is embedded at compile time (go:embed), so changes require rebuilding openshift-tests.

How to regenerate the allowlist from BigQuery

SELECT
  rs.TestName,
  COUNT(*) AS flake_count
FROM `openshift-ci-data-analysis.ci_data_autodl.retry_statistics` rs
INNER JOIN `openshift-gce-devel.ci_analysis_us.job_variants` jv
  ON rs.JobName = jv.job_name
  AND jv.variant_name = 'Release'
  AND jv.variant_value = '4.22'
WHERE rs.FinalOutcome = 'flaky'
  AND rs.TestName IS NOT NULL AND rs.TestName != ''
  AND rs._PARTITIONTIME >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY rs.TestName
HAVING COUNT(*) >= 5
ORDER BY flake_count DESC

The goal is to reduce this list over time as flaky tests are fixed, eventually reaching an empty list (no retries for anyone).

Summary by CodeRabbit

  • Tests
    • Retry behavior refactored to use an embedded allowlist: only tests present in the allowlist are eligible for a single retry.
    • Added tests verifying allowlist-driven retry decisions and confirming the allowlist loader returns a valid result.
    • Retry decisions now emit debug logs for allow/deny outcomes and default to no retries if the allowlist cannot be parsed.

@openshift-ci openshift-ci Bot requested review from deads2k and sjenning May 27, 2026 12:51
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 27, 2026
@coderabbitai

coderabbitai Bot commented May 27, 2026

Copy link
Copy Markdown

Walkthrough

The pull request refactors test retry eligibility in pkg/test/ginkgo/retries.go from complex in-function rules to an embedded YAML allowlist. RetryOnceStrategy now stores and uses an in-memory allowlist of retry-eligible test names; shouldRetryTest permits retries only when a test name is in the allowlist. Tests verify both the allowlist loader and the updated retry behavior.

Changes

Test Retry Allowlist Refactor

Layer / File(s) Summary
Allowlist infrastructure and initialization
pkg/test/ginkgo/retries.go, pkg/test/ginkgo/retry_allowed_tests.yaml
Embeds retry_allowed_tests.yaml and defines YAML schema via loadRetryAllowedTests(), which unmarshals into a map[string]bool. RetryOnceStrategy struct gains AllowedRetryTests field and initializes it in NewRetryOnceStrategy().
Retry eligibility logic refactor
pkg/test/ginkgo/retries.go
shouldRetryTest now permits retries only when test.name is present in AllowedRetryTests with debug logging; prior image-tag and binary-presence rules are removed.
Test coverage for allowlist mechanism
pkg/test/ginkgo/retries_test.go
Table-driven test (TestRetryOnceStrategy_ShouldRetryTest_Allowlist) validates retry behavior with a supplied allowlist map. New TestLoadRetryAllowedTests verifies the embedded loader returns a non-nil map.

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in modified files are static, descriptive strings. No Ginkgo test titles with dynamic content (UUIDs, timestamps, generated suffixes) found.
Test Structure And Quality ✅ Passed The PR adds standard Go unit tests, not Ginkgo Describe/Context/It block tests. Custom check for Ginkgo test code is not applicable.
Microshift Test Compatibility ✅ Passed PR does not add new Ginkgo e2e tests. Changes are framework code only: refactored retry logic, unit tests using Go testing, and a YAML configuration file.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests added in this PR. Changes are to test infrastructure (retry strategy) and Go unit tests, not Ginkgo e2e test declarations.
Topology-Aware Scheduling Compatibility ✅ Passed This PR only modifies test infrastructure in pkg/test/ginkgo/ (test retry eligibility logic). No deployment manifests, operator code, or controllers introducing scheduling constraints are modified.
Ote Binary Stdout Contract ✅ Passed No OTE Contract violations. YAML parsing via yaml.Unmarshal() doesn't write to stdout, and error logging uses logrus which writes to stderr by default.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests added. PR modifies retry logic infrastructure and adds standard Go unit tests (testing.T), not Ginkgo e2e tests. Check not applicable.
No-Weak-Crypto ✅ Passed No weak cryptographic patterns found. Code uses only standard libraries for YAML parsing and test retry allowlisting; no crypto, hash, or TLS imports present.
Container-Privileges ✅ Passed PR does not contain K8s container manifests or security configurations. Changes are limited to test retry logic in Go code and a YAML list of eligible test names.
No-Sensitive-Data-In-Logs ✅ Passed All logging only exposes non-sensitive data: YAML errors, test names, retry metrics, and job metadata. No passwords, tokens, API keys, PII, or credentials are logged.
Title check ✅ Passed The title 'Restrict test retries to an allowlist of test names' accurately and concisely summarizes the main change: replacing image-tag-based retry eligibility with a YAML allowlist as the sole source of truth for test retry permissions.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/test/ginkgo/retries_test.go`:
- Around line 101-107: The test TestLoadRetryAllowedJobs is not detecting parse
failures because loadRetryAllowedJobs hides errors; change the loader to expose
parse errors and update the test to assert them: modify loadRetryAllowedJobs (or
add a new helper like loadRetryAllowedJobsWithError) to return (map[string]bool,
error) and have the test call that function, assert error == nil and the
returned map is non-nil/non-empty; update callers accordingly (or keep a thin
wrapper loadRetryAllowedJobs that panics on error for production code) so the
test will fail when the embedded YAML is malformed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 9a11f0a6-4dbc-4d47-8df4-a2e6327e42d7

📥 Commits

Reviewing files that changed from the base of the PR and between 355916d and f7fa00e.

📒 Files selected for processing (3)
  • pkg/test/ginkgo/retries.go
  • pkg/test/ginkgo/retries_test.go
  • pkg/test/ginkgo/retry_allowed_jobs.yaml

Comment thread pkg/test/ginkgo/retries_test.go Outdated
@mkowalski mkowalski changed the title Restrict test retries to an allowlist of CI jobs Restrict test retries to an allowlist of test names May 27, 2026
@mkowalski

Copy link
Copy Markdown
Contributor Author

@coderabbitai help

@coderabbitai

coderabbitai Bot commented May 27, 2026

Copy link
Copy Markdown

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

CodeRabbit commands

These commands are invoked using PR/Issue comments.

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai rate limit to show your current review rate limit status.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai evaluate custom pre-merge check --instructions <custom-checks-instructions> --name <custom-checks-title> [--mode <error|warning>] to validate the custom pre-merge checks instructions. Defaults to error when --mode is omitted.
  • @coderabbitai ignore pre-merge checks to override pre-merge checks and get an approval on PR.
  • @coderabbitai run pre-merge checks to run pre-merge checks on the pull request.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai emit path instructions to emit generated path-specific review instructions for this repository.
  • @coderabbitai resolve merge conflict to automatically resolve merge conflicts.
  • @coderabbitai autofix to automatically fix issues identified in unresolved review comments.
  • @coderabbitai resolve to resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai generate configuration to create a PR that adds the current resolved configuration as .coderabbit.yaml (or show it if already present).
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, support, documentation and community

  • Visit our status page to check the current availability of CodeRabbit.
  • Create a ticket on our support page for assistance with any issues or questions.
  • Visit our documentation site for detailed information on how to use CodeRabbit.
  • Join our Discord community to connect with other users and get help from the community.
  • Follow us on X/Twitter for updates and announcements.

@mkowalski

Copy link
Copy Markdown
Contributor Author

@coderabbitai summary
@coderabbitai full review

@coderabbitai

coderabbitai Bot commented May 27, 2026

Copy link
Copy Markdown

@mkowalski Sure! I'll regenerate the PR summary and kick off a full review of all the changes.

✅ Actions performed

Summary regeneration triggered.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/test/ginkgo/retries.go`:
- Around line 16-37: The current code implements a test-level allowlist
(retryAllowedTestsYAML, retryAllowedTestsConfig, loadRetryAllowedTests) but the
PR requires a job-level allowlist driven by JOB_NAME; update the implementation
to match by adding a job-level loader (e.g., loadRetryAllowedJobs reading
retry_allowed_jobs.yaml into a struct like Jobs []string) and change the retry
decision logic to consult os.Getenv("JOB_NAME") against that allowlist before
allowing retries, or alternatively update the PR text to state this is
test-level behavior; locate uses of loadRetryAllowedTests and the retry decision
point and replace or supplement them to check JOB_NAME membership in the new
jobs set.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 3d187d45-503b-4721-9d24-973c1daa5c5b

📥 Commits

Reviewing files that changed from the base of the PR and between f7fa00e and 0412516.

📒 Files selected for processing (3)
  • pkg/test/ginkgo/retries.go
  • pkg/test/ginkgo/retries_test.go
  • pkg/test/ginkgo/retry_allowed_tests.yaml

Comment thread pkg/test/ginkgo/retries.go
@openshift-ci openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label May 27, 2026
@mkowalski

Copy link
Copy Markdown
Contributor Author

/cc @dgoodwin

@openshift-ci openshift-ci Bot requested a review from dgoodwin May 27, 2026 14:18
@dgoodwin

Copy link
Copy Markdown
Contributor

The query claude came up with is better than I would have, I did not know that table exists, and it's perfect for our purposes.

I suggest this query:

  SELECT
    rs.TestName,
    COUNT(*) AS flake_count,
    COUNT(DISTINCT rs.JobName) AS distinct_jobs_affected,
    SUM(rs.TotalAttempts) AS total_attempts,
    SUM(rs.FailedAttempts) AS total_failed_attempts
  FROM `openshift-ci-data-analysis.ci_data_autodl.retry_statistics` rs
  INNER JOIN `openshift-gce-devel.ci_analysis_us.job_variants` jv
    ON rs.JobName = jv.job_name
    AND jv.variant_name = 'Release'
    AND jv.variant_value = '4.22'
  WHERE rs.FinalOutcome = 'flaky'
    AND rs.TestName IS NOT NULL AND rs.TestName != ''
    AND rs.PartitionTime >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
  GROUP BY rs.TestName
  HAVING COUNT(*) >= 5
  ORDER BY flake_count DESC

We're not concerned with flakes in old releases, and filtering to the past 30 days seems logical. 4.22 has the best data over the past month as it's the most stable.

If a test flaked only once in the last month, I'm fine removing it's right to flake. We may have to add more to this list over time, things could surface in CR, but hopefully that will be rare. 5 seems like a good threshold, and that should be about 207 tests.

Threshold options from claude:

┌───────────┬────────────┬────────────┬──────────────────────┐
│ Threshold │ Tests kept │ % of tests │ Flake events covered │
├───────────┼────────────┼────────────┼──────────────────────┤
│ >= 2 │ 395 │ 81% │ 99.0% │
├───────────┼────────────┼────────────┼──────────────────────┤
│ >= 3 │ 316 │ 65% │ 97.3% │
├───────────┼────────────┼────────────┼──────────────────────┤
│ >= 5 │ 207 │ 42.5% │ 93.2% │
├───────────┼────────────┼────────────┼──────────────────────┤
│ >= 10 │ 113 │ 23.2% │ 86.6% │
└───────────┴────────────┴────────────┴──────────────────────┘

We could go lower though, 2-3 would also be reasonable IMO.


func NewRetryOnceStrategy() *RetryOnceStrategy {
return &RetryOnceStrategy{
PermittedRetryImageTags: []string{"tests"}, // tests = openshift-tests image

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stbenjam this looks safe to me, but would appreciate if you can confirm. Mat is closing the retry gap at last, with an explicit list of tests that need retries only allowed.

"[sig-node] Pods Extended Pod Container lifecycle evicted pods should be terminal [Suite:openshift/conformance/parallel] [Suite:k8s]", // https://issues.redhat.com/browse/OCPBUGS-57658
"[sig-cli] Kubectl logs all pod logs the Deployment has 2 replicas and each pod has 2 containers should get logs from each pod and each container in Deployment [Suite:openshift/conformance/parallel] [Suite:k8s]", // https://issues.redhat.com/browse/OCPBUGS-61287
"[sig-cli] Kubectl Port forwarding Shutdown client connection while the remote stream is writing data to the port-forward connection port-forward should keep working after detect broken connection [Suite:openshift/conformance/parallel] [Suite:k8s]", // https://issues.redhat.com/browse/OCPBUGS-61734
"[sig-storage] OCP CSI Volumes [Driver: csi-hostpath-groupsnapshot] [OCPFeatureGate:VolumeGroupSnapshot] [Testpattern: (delete policy)] volumegroupsnapshottable [Feature:volumegroupsnapshot] VolumeGroupSnapshottable should create snapshots for multiple volumes in a pod", // https://issues.redhat.com/browse/OCPBUGS-66967

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check if all these are in the list with the new query I added in the PR, if not we should add them.

Also could you ensure the list is sorted, the query I provided isn't. :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list is now sorted. The "migrated" whitelisted tests are there. After running your revised query we are down to ~200 tests only.

@dgoodwin

Copy link
Copy Markdown
Contributor

Code looks good, just a revision to the query and some sorting concerns.

Replace the image-tag-based and hardcoded retry eligibility logic in
shouldRetryTest() with a single YAML allowlist (retry_allowed_tests.yaml)
that is the sole source of truth for which tests may retry on failure.

Previously, any test from a permitted image tag could retry, and 6
specific tests had hardcoded exceptions. Now, a test must be explicitly
listed in the YAML file to get a retry. Tests not on the list must pass
on the first attempt.

The allowlist is populated from BigQuery retry_statistics data: all 1395
tests that historically failed first but passed on retry are included.
The 6 previously hardcoded exception tests (OCPBUGS-57477, OCPBUGS-57665,
OCPBUGS-57658, OCPBUGS-61287, OCPBUGS-61734, OCPBUGS-66967) are covered
by the BigQuery data and no longer need special treatment in Go code.

The list should be reduced over time as flaky tests are fixed. To
disable retries entirely, empty the YAML list.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/test/ginkgo/retry_allowed_tests.yaml (1)

8-23: ⚡ Quick win

Consider enriching the documented query with additional metrics.

The BigQuery query documents how to regenerate the allowlist, but it only returns TestName and flake_count. Based on the PR comments, a more informative query would also include distinct_jobs_affected, total_attempts, and total_failed_attempts to help maintainers make better decisions when regenerating the list.

📊 Suggested enhanced query
SELECT
  rs.TestName,
  COUNT(*) AS flake_count,
  COUNT(DISTINCT rs.JobName) AS distinct_jobs_affected,
  SUM(rs.TotalAttempts) AS total_attempts,
  SUM(rs.FailedAttempts) AS total_failed_attempts
FROM `openshift-ci-data-analysis.ci_data_autodl.retry_statistics` rs
INNER JOIN `openshift-gce-devel.ci_analysis_us.job_variants` jv
  ON rs.JobName = jv.job_name
  AND jv.variant_name = 'Release'
  AND jv.variant_value = '4.22'
WHERE rs.FinalOutcome = 'flaky'
  AND rs.TestName IS NOT NULL AND rs.TestName != ''
  AND rs._PARTITIONTIME >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY rs.TestName
HAVING COUNT(*) >= 5
ORDER BY flake_count DESC

This provides better visibility into the scope and impact of each flaky test.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/test/ginkgo/retry_allowed_tests.yaml` around lines 8 - 23, Update the
documented BigQuery in pkg/test/ginkgo/retry_allowed_tests.yaml (the query that
selects from retry_statistics aliased as rs and joins job_variants as jv) to
include the additional metrics: COUNT(DISTINCT rs.JobName) AS
distinct_jobs_affected, SUM(rs.TotalAttempts) AS total_attempts, and
SUM(rs.FailedAttempts) AS total_failed_attempts; keep the existing WHERE, JOIN,
GROUP BY rs.TestName, HAVING COUNT(*) >= 5 and ordering by flake_count so the
query returns TestName, flake_count plus the three new aggregate columns to
provide better context for maintainers.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/test/ginkgo/retry_allowed_tests.yaml`:
- Around line 8-23: Update the documented BigQuery in
pkg/test/ginkgo/retry_allowed_tests.yaml (the query that selects from
retry_statistics aliased as rs and joins job_variants as jv) to include the
additional metrics: COUNT(DISTINCT rs.JobName) AS distinct_jobs_affected,
SUM(rs.TotalAttempts) AS total_attempts, and SUM(rs.FailedAttempts) AS
total_failed_attempts; keep the existing WHERE, JOIN, GROUP BY rs.TestName,
HAVING COUNT(*) >= 5 and ordering by flake_count so the query returns TestName,
flake_count plus the three new aggregate columns to provide better context for
maintainers.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 89a30401-3429-4f1c-8564-5069943c9079

📥 Commits

Reviewing files that changed from the base of the PR and between 0412516 and 8b792da.

📒 Files selected for processing (3)
  • pkg/test/ginkgo/retries.go
  • pkg/test/ginkgo/retries_test.go
  • pkg/test/ginkgo/retry_allowed_tests.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/test/ginkgo/retries_test.go
  • pkg/test/ginkgo/retries.go

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@dgoodwin

Copy link
Copy Markdown
Contributor

/lgtm

Nice when one of our strategic initiatives is closed out in a couple hours and one PR :) Thanks!

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 28, 2026
@openshift-ci

openshift-ci Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, mkowalski

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mkowalski mkowalski changed the title Restrict test retries to an allowlist of test names NO-ISSUE: Restrict test retries to an allowlist of test names May 28, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 28, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@mkowalski: This pull request explicitly references no jira issue.

Details

In response to this:

Summary

  • Replace the image-tag-based and hardcoded retry eligibility in shouldRetryTest() with a single YAML allowlist (retry_allowed_tests.yaml) that is the sole source of truth for which tests may retry on failure.
  • Previously, any test from a permitted image tag could retry, and 6 specific tests had hardcoded exceptions. Now, a test must be explicitly listed in the YAML to get a retry.
  • The allowlist is populated from BigQuery retry_statistics data, scoped to release 4.22, last 30 days, with a threshold of >= 5 flake occurrences. This yields 206 tests.

Behavior

Scenario Retries?
Test listed in the YAML allowlist Yes
Test NOT in the allowlist No
Empty allowlist (tests: []) No

How to allow retries for a test

Add the full test name to pkg/test/ginkgo/retry_allowed_tests.yaml:

tests:
 - "[sig-network] Services should be rejected for evicted pods ..."
 - "[sig-node] Pods Extended Pod Container lifecycle ..."

The file is embedded at compile time (go:embed), so changes require rebuilding openshift-tests.

How to regenerate the allowlist from BigQuery

SELECT
 rs.TestName,
 COUNT(*) AS flake_count
FROM `openshift-ci-data-analysis.ci_data_autodl.retry_statistics` rs
INNER JOIN `openshift-gce-devel.ci_analysis_us.job_variants` jv
 ON rs.JobName = jv.job_name
 AND jv.variant_name = 'Release'
 AND jv.variant_value = '4.22'
WHERE rs.FinalOutcome = 'flaky'
 AND rs.TestName IS NOT NULL AND rs.TestName != ''
 AND rs._PARTITIONTIME >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY rs.TestName
HAVING COUNT(*) >= 5
ORDER BY flake_count DESC

The goal is to reduce this list over time as flaky tests are fixed, eventually reaching an empty list (no retries for anyone).

Summary by CodeRabbit

  • Tests
  • Retry behavior refactored to use an embedded allowlist: only tests present in the allowlist are eligible for a single retry.
  • Added tests verifying allowlist-driven retry decisions and confirming the allowlist loader returns a valid result.
  • Retry decisions now emit debug logs for allow/deny outcomes and default to no retries if the allowlist cannot be parsed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@mkowalski

Copy link
Copy Markdown
Contributor Author

/test e2e-metal-ipi-ovn-ipv6

@mkowalski

Copy link
Copy Markdown
Contributor Author

/verified by later

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 29, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@mkowalski: This PR has been marked as verified by later.

Details

In response to this:

/verified by later

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci

openshift-ci Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

@mkowalski: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit 732285e into openshift:main May 29, 2026
21 checks passed
var config retryAllowedTestsConfig
if err := yaml.Unmarshal(retryAllowedTestsYAML, &config); err != nil {
logrus.WithError(err).Error("Failed to parse retry_allowed_tests.yaml, no tests will be allowed to retry")
return map[string]bool{}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should've used the kube set type, FWIW

APIGroups sets.Set[string] `json:"-"`

@stbenjam

Copy link
Copy Markdown
Member

Sorry I only got around to looking at this now, logic looks correct to me 👍🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants